Comparison and fusion prediction model for lung adenocarcinoma with micropapillary and solid pattern using clinicoradiographic, radiomics and deep learning features

To investigate whether the combination scheme of deep learning score (DL-score) and radiomics can improve preoperative diagnosis in the presence of micropapillary/solid (MPP/SOL) patterns in lung adenocarcinoma (ADC). A retrospective cohort of 514 confirmed pathologically lung ADC in 512 patients after surgery was enrolled. The clinicoradiographic model (model 1) and radiomics model (model 2) were developed with logistic regression. The deep learning model (model 3) was constructed based on the deep learning score (DL-score). The combine model (model 4) was based on DL-score and R-score and clinicoradiographic variables. The performance of these models was evaluated with area under the receiver operating characteristic curve (AUC) and compared using DeLong's test internally and externally. The prediction nomogram was plotted, and clinical utility depicted with decision curve. The performance of model 1, model 2, model 3 and model 4 was supported by AUCs of 0.848, 0.896, 0.906, 0.921 in the Internal validation set, that of 0.700, 0.801, 0.730, 0.827 in external validation set, respectively. These models existed statistical significance in internal validation (model 4 vs model 3, P = 0.016; model 4 vs model 1, P = 0.009, respectively) and external validation (model 4 vs model 2, P = 0.036; model 4 vs model 3, P = 0.047; model 4 vs model 1, P = 0.016, respectively). The decision curve analysis (DCA) demonstrated that model 4 predicting the lung ADC with MPP/SOL structure would be more beneficial than the model 1and model 3 but comparable with the model 2. The combined model can improve preoperative diagnosis in the presence of MPP/SOL pattern in lung ADC in clinical practice.


Materials and methods
Patients. This respective multicohort study was approved by the Institutional Review Board of The First Affiliated Hospital of Nanjing Medical University (Permit Number: 2021-SRFA-238) and the Institutional Review Board of the Affiliated Huaian NO.1 People's Hospital of Nanjing Medical University (Permit Number: 2022-0451-01), respectively. And the requirement of written informed consent was waived because that all data sources applied (demography, laboratory records, and chest CT) were previously available and analyzed anonymously and de-identified using study identifier before reading by the radiologists, model development, internally and externally validation. The need to obtain informed consent from all participants was waived by the Institutional Review Board of The First Affiliated Hospital of Nanjing Medical University and the Institutional Review Board of The Affiliated Huaian NO.1 People's Hospital of Nanjing Medical University. All methods were performed in accordance with the relevant guidelines and regulations.
The study reviewed 2567 cases undergoing CT scans for preoperative assessment from April 2016 to October 2019. All cases of lung ADC were histologically proven and recorded in our hospital database. The inclusion criteria were as follows: (a) no prior history of other cancer; (b) no prior history of radiation therapy or chemotherapy before chest surgery; (c) pathologically confirmed to be ADC and without any variant subtypes (colloid, mucinous, fatal adenocarcinoma, etc.); (d) CT images with thin section (1.5/1.0 mm) quality were adequate for analysis; (e) clinical and imaging data for this study were obtained from the medical records database; (f) the patients with tumor was no more than stage III A. Patients were excluded for one of the following reasons: (a) no preoperative CT scan at our institution (n = 154); (b) distal metastasis (n = 4); (c) history of radiation and chemotherapy before scanning (n = 19); (d) unsatisfactory imaging quality due to respiratory artifact during examination (n = 60); (e) insufficient laboratory examination data (n = 296), (f) minimally invasive adenocarcinoma (n = 648).
Finally, of 514 pulmonary lesions were recorded in 512 patients (males: females, 228:284; mean age ± standard deviation SD, 59.3 ± 10.1 years; range 26-82 years) in our institution. Of these, two lesions were detected in 2 and one in 510 patients. The derivation cohort had 360 cases, including 134 MPP/SOL positive and 226 MPP/SOL negative. Additionally, 154 cases assigned as the independent internal validation cohort were consisted of 57 MPP/SOL positive and 97 MPP/SOL negative. We attempted to search data for external validation on the public dataset collected in TCIA (https:// www. cance rimag ingar chive. net/), but mainly because of the lack of concrete pathological results. Therefore, we found another 101 cases (male: female, 48:53; mean age, 60.7 ± 9.4 years; range 31-75 years) from another hospital as the external validation cohort. Workflow of our study is shown in Fig. 1 Imaging protocols. Imaging acquisition was performed at our institution using an unenhanced CT scanner with 64-slice detector (SIEMENS SOMATOM Force; SIEMENS Definition AS+; GE MEDICAL SYSTEMS Revolution) and 128-slice detector (Philips iCT 256). All CT scans were performed in the supine position from the thoracic inlet to the margin of the kidney. All image protocol shared the following parameters: slice thickness,1.5 mm; tube voltage, 100-120 kVp; tube current, 80-300 mAs; matrix size, 256 × 256; and field of view,  Establishing the clinicoradiographic model. The CT images were evaluated independently by two radiologists (author #1 and author #4, 3 and 8 years of experience in thoracic imaging diagnosis, respectively). High-resolution CT (HRCT) images (qualitative and quantitative) using to evaluate morphological features. The morphological features such as maximum tumor diameter consolidation/tumor diameter (C/T) ratio calculated by twice decoding, density (pure ground glass, mixed ground glass or solid), margin (lobulation and spiculation: absent or present), lung-tumor interface(clear or blur); internal manifestation (cavity, calcification, and honeycomb sign: absent or present), pleural tag and indentation (absent or present) 19 , change in the vessels (absent or present), relation to bronchus (absent or present). The term "change in the vessel" was defined as the formation of a collection of blood vessels adjacent to the tumor, and multi-planar reconstruction (MPR) was used to determine whether the change was present. According to Qiang et al. 20 , the tumor-bronchus relationship assessment criteria have been modified to include visual evaluation of a positive manifestation of tumor-bronchus as bronchus extension into the lesion with a tapered lumen and interruption, or bronchus abrupt obstruction on the margin of the lesion. The clinic data involves gender, age, smoking history, family history and serum biomarkers including carcinoembryonic antigen (CEA) value, neuro-specific enolase (NSE) value, cytokeratin fragment 21-1(CYFRA21-1) value, which are recorded and categorized according to the level of 4.70 ng/ml, 16.30 ng/ ml and 3.30 ng/ml, respectively. Kappa values and intra-class correlation coefficients (ICCs) were calculated to assess the consistency of the two authors′ radiology evaluations. Univariate analyses were used to determine the differences between MPP/SOL negative and positive groups on the derivation set. Logistic regression analysis was adopted to build clinicoradiographic model.

Radiomics model built.
The radiomics model was built including four steps, that is, volumes of interest (VOI) segmentation, radiomics features extraction, feature selection and radiomics signature establishment and assessment. The VOIs encompass the entire tumor information. Semi-automatic contouring was performed in thin-section (1.0/1.5 mm) CT images with in-house software (MULTILABEL; ECNU, Shanghai, China). The semiautomatic identification of the VOI of the lesions relies on radiologists to locate the lesion, and then were implemented with a cooperation of semi-automatic segmentation thresholding algorithm and a manual adjustment approach of delineation on every section of the CT scans, containing two major steps as reported in forepassed literatures 15 . The initial segmentation is followed by the step of removal of surrounding vessel, bronchus, and calcification. The radiologists with 3 years of thoracic diagnosis experience (authors #1) and another one with 8 years of experience (author#4) blinded to experimental design reviewed the image and annotated VOIs avoiding necrosis, calcification, vascular structure, etc 8 . Once the true tumor boundary cannot be deduced precisely from the image, another radiologist with 10 years of experience in chest CT interpretation reviewed and confirmed the delineation of the lesion. The disagreements would be resolved by the observer′s investigation and comprehensive assessment. To ensure the stability of the radiomics feature extraction, the VOI of each lesion was drawn twice by each of two independent radiologists. The radiologists (author#1) annotated the VOIs of 60 cases randomly selected from the study after 3 months. The intra-and inter-class correlation coefficients were calculated after the segmentation. VOI segmentation information was converted to the NII format, followed by the features extraction with the aid of FeAture Explorer Pro (FAE Pro, V0. 3.7, (https:// github. com/ salan 668/ FAE. git) on Python (3.7.6) 21 . The process of image clip is making that pixel values less than 5% and more than 95% are controlled at 5% and 95% respectively to remove the influence of extreme pixel points. The extracted radiomic features were normalized to a standard unit using the following equation: f(x) = 1000 * (χ − µχ)/σχ, where µχ and σχ denote the mean and standard deviation of the image intensity, respectively. Expounded description regarding the initial settings used in FeAture Explorer Pro for the feature extraction process are provided in Supplementary Appendix 1. Image types includes original image and extract feature categories. Eventually, of 107 radiomics features extracted from the 3D-dimensional region were composed of three types, that is, shape features (number of features [m] = 18), first-order features (m = 14) and texture features (m = 75). Texture features include 24 Gy level co-occurrence matrix (GLCM) features, 16 Gy level run length matrix (GLRLM) features, 16 Gy level size zone matrix (GLSZM) features, 5 neighborhood gray tone difference matrix (NGTDM) features, 14 Gy level dependency matrix (GLDM) features. The intra-class correlation coefficients (ICCs) of the features were calculated to evaluate the inter-observer reproducibility and the features with ICC > 0.80 were enrolled in subsequent analysis. The derivation and internal validation sets are split in a ratio of 7:3. We up-samples by repeating random cases to balance the samples of micropapillary and solid negative and positive. The L2 norm was computed and divided by each feature vector. The feature vector was then mapped into a unit vector. We examined the similarity of each feature pair and eliminated one if its PCC (Pearson Correlation Coefficient) value which was greater than 0.99 to reduce the dimension of the feature space. We used recursive feature elimination (RFE) to select radiomics features based on a classifier by repeatedly considering a smaller set of features. Analysis of variance (ANOVA) was used to investigate the significant features associated with the labels. We sorted features according to their corresponding feature value (F-value), which were calculated to determine the relationship between features and labels, and selected a specific number of features to build the optimal integrated model. To identify predictive features in the model, we used logistic regression with the LASSO (Least absolute shrinkage and selection operator). The final lost function was augmented with the L1 norm, and the weights were constrained. The radiomics models′ hyper-parameters were based on the model's performance on www.nature.com/scientificreports/ the internal validation data set. Figure S1A-E depicts the automated segmentation process, features′ reproducibility analysis, feature selection and model development.
The radiomics model derived from the Least absolute shrinkage and selection operator (LASSO) procedure using fivefold cross validation in the derivation set without wavelet features.
Deep learning network architecture implementation. In this study, the deep learning model was developed using a derivation set based on the "wide residual network" (WRN) architecture 22 , The architecture of WRN50 is consisted of 50 convolutional layers with 4 max-pooling layers. We fed 2.5D patches with a size of 3 × 3 × 64 pixels into the neural network. Each patch was cropped to contain the maximum area of the nodule's slice. Three adjacent slices were extracted and concatenated to form a 3-channel patch ( Fig. 2A). The chest images were used as input data, deep learning score (DL-score) is the output of the network which was transferred by sigmoid function as the probability of being MPP/SOL negative or MPP/SOL positive subtype. All annotated CT images were used as input material to create a heat map illustrating a sensitive region of interest (ROI), which showed the regions with the greatest impact on the final prediction layer of the input images 23 . We used data augmentation techniques to suppress the problem of overfitting in derivation phase. Stochastic gradient descent (SGD) optimizer was employed to train the network with binary cross-entropy loss. Five-cross validation was used to evaluate the quality of proposed model and avoid over fitting and under fitting. PyTorch (version 1.6.0; https:// pytor ch. org) was used to implement the algorithm. A gradient-weighted activation mapping (Grad-CAM) 15,23 is a commonly used method in computer vision field to provide interpretability. In this work, we used Grad-CAM to visualize the important regions of the input image data (Fig. 2B). Higher value indicated the more indicative CT areas for associated prediction-making. The Grad-CAM algorithm was implemented by our deep learning framework "Strix" (https:// github. com/ Proje ct-Strix/ Strix) constructed in PyTorch (version 1.6.0; https:// pytor ch. org). The output of the final dense layer was our deep learning signature, which was transferred with the sigmoid function as the DL-score and built DL model.

Combined model built.
Logistic regression analysis was conducted to develop the model 4 with the incorporation of the selected meaningful clinical features, DL-score and R-score to construct the combined model on the derivation set. Figure 3 illustrates the steps involved in our study, including clinicoradiographic features selection, segmentation of CT images, feature extraction and selection from radiomics, deep learning network construction, combined model construction and model analysis.
Clinical utility and prediction nomogram. We internally assessed the models' performance in independent internal validation data using the receiver operating characteristic (ROC) curve analysis. The prediction nomogram was plotted based on logistic regression based on the combined model. Consistency between the nomogram was implemented with Hosmer-Lemeshow goodness-of-fit (GOF) test using calibration curves via bootstrapping with 1000 resamples 24 . The clinical usefulness was estimated with decision curve analysis and with the visualization of decision curve and clinical impact curve 25 . The illustrated WRN50 architecture was presented in our study which is conducted by PyTorch (version 1.6.0; http:// pytor ch. org). (B) From the left, the first column represents 2.5D patches with a size of 3 × 3 × 64 pixels that is locally exhibited the tumor and the second column reveals the activation heat maps. Last column makes a good visual reference for the prediction probability of MPP/SOL negative/positive, that is DL-score, which is observed by the importance of each part of tumor generated by ("Strix" (https:// github. com/ Proje ct-Strix/ Strix)). WRN wide residual network; DL-score deep learning score.

Results
Supplementary Table S1 presents that there were no statistically significant differences in clinic-radiographic characteristics between the derivation and internal validation sets (P all > 0.05). Table 1 shows the contribution of clinicoradiographic, radiomics, and DL-score features along with their associated coefficients in different models.

Model-development.
Model 1: clinicoradiographic features. The consistency analysis of radiographic variables between two radiologists are listed in Table 2 (Fig. 4A-C). The cut-off value of R-score is 0.582 chosen to maximize the Youden index of the ROC analysis from the derivation set.  (Fig. 4D-F). The cut-off value from derivation set is 0.656 chosen to maximize the Youden index of the ROC analysis.
Model 4: combined model. The features′ dimensionality was reduced from 107 to 16 features. And non-zero coefficients of the finally calculated as R-score combined with clinicoradiologic variables and DL-score in derivation set revealed that serum CEA (≥ 4.70 ng/ml), DL-score and R-score were independent predictor for MPP/SOL positive lesions based on logistic regression. The calculation formula = 3.722*DL-score + 2.405*R-score + 1.966*serum CEA level (≥ 4.70 ng/ml) + (− 4.173).  The value of DL-score is also obviously higher in patients of lung ADC with MPP/SOL positive in the derivation set (D), the Internal validation set (E) and external validation set (F) (p all < 0.001). R-score = radiomics score; DL-score = deep learning score.  A prediction nomogram building. The prediction nomogram was plotted based on logistic regression on the basis of the combined model, which is added significant incremental performance to the clinical model (Fig. 6A). Favorable calibration of the nomogram corroborated both in the derivation, internal validation set (Fig. 6B) and external validation set (Fig. 6B) demonstrated good calibration with p value of 0.921, 0.339 and 0.205, in the derivation, internal and external validation set in the Hosmer and Lemeshow goodness of fit (GOF) test, respectively.
Clinical practice. Decision curve analysis (DCA) revealed that indicating that the radiomics model, DL model or the combined model achieved moderately better net benefits than the clinical and radiographic factors or DL -score alone to predict lung ADC with MPP/SOL positive at threshold probability (5-80%) and comparable better net benefits to the radiomics model in internal validation set (Fig. 7A). If the threshold probability of a patient in a range of (48-52%) and (56-88%), the combined model used to predict positive lesions would be more beneficial than DL model and clinicoradiographic model (Fig. 7B). Clinical impact curve (Fig. S2A,B) shows the estimated number who would be affirmed high risk for each risk threshold and visually shows the proportion of those true positive cases 24.  www.nature.com/scientificreports/

Discussion
To date, lots of study has developed invasive preoperative model for assessment of lung ADC with MPP/SOL patterns. Our study extended previous work [7][8][9][10]27,28 by decoding the phenotypes of lung ADC using clinical, quantitative radiomics features and DL-score based scheme model separately and combined, attempting to determine whether an integrated strategy incorporating clinical, radiomics, and DL-score features could be used to discriminate lung ADC with MPP/SOL pattern for a more effective surgical approach or subsequent treatment scheme. The proposed combined model with a sensitivity of 91.2% and 83.9%, accuracy of 84.7%, 75.2%, AUC of 0.921 and 0.827 in internal and external validation, respectively, was significantly superior to deep learning and clinicoradiographic model alone (P < 0.05). Firstly, we developed a clinicoradiographic model (model 1) to classify lung ADC with or without MPP/SOL pattern using clinical, imaging, and histopathological profiling data from 625 lung ADC (internal dataset: external dataset; 514:101). Morphological CT features, such as lobulation, spiculation and maximum -tumor diameter all contribute to the differential diagnosis of MPP/SOL negative and positive. Our study revealed that the more lobulation, spiculation and bigger maximum-tumor diameter observed in lung ADC, the higher the probability that the lesion is a lung ADC with MPP/SOL pattern, which is comparably consistent with the findings of Park and his colleagues 13 that solid-predominant adenocarcinomas are more likely to exhibit spiculation or lobulation, which could be explained by enrolment bias. Clinical features such as serum biomarkers are widely believed to be a useful blood-based diagnosis for lung adenocarcinoma 29 , particularly the serum CEA level. The findings indicated that abnormal serum CEA levels are more prevalent in lung ADC with MPP/SOL positive than in MPP/SOL negative, but further study has more samples need to confirm it in future. Several studies 11,15 focused on measuring the C/T ratio as the main morphological predictor on invasiveness histological lesions in lung ADC, especially in part-solid nodules. The C/T ratio calculated in our study showed no statistical significance in derivation and internal validation set. It may be resulting from the limited various density nodules sample size, even though we attempt to randomly distribute the dataset with the aid of machine operation. Chen et al. 15 investigated the C/T ratio discernibility of lung ADC with MPP/SOL with AUC of 0.850, which is equal to the detection ability of radiomics model. Regarding the result of performance of C/T ratio, we considered patient selection bias to be the reason, that is the case of lung ADC with MPP/SOL is relatively fewer than another lung ADC pattern mainly featuring part-solid density.
Secondly, we compared to models built using DL model (model 3) and radiomics model (model 2) separately, the model 3 produced equivalent performance (P = 0.320). It was concluded that a DL model was feasible for classifying lung ADC with MPP/SOL pattern. However, the classification performance of the model 3 was inferior to that of model 2 in the validation set (AUC 0.906 vs 0.887), but superior in the derivation set (AUC 0.906 vs 0.896), suggesting that overfitting might occurred during the derivation process 14. And previous study 30 pointed that the opaqueness of deep learning results from implicit feature engineering or modeling makes its application a little difficult in clinic practice. In contrast, the radiomics features may be more feasible considering the standardized radiomics process, platform of tumor segmentation and model development 31 .
As we all known, the combination of radiomics features into R-score value can simplify the analysis flow. Finally, we attempted to combine R-score and DL-score with selected clinical findings to establish combined models to determine the predictive power and clinical relevance of lung ADC subtype discrimination. We discovered that model 4 had the best performance in preoperatively diagnosing MPP/SOL pattern in lung ADC patients with an AUC (0.921) in the internal validation set outperforming the model 3 and model 1, however, achieving a halfway decent result compared to model 2. We hypothesized that the radiomics model had sufficient signature for prediction and that additional variables could not significantly improve the discriminating capacity. The model 2 obtained a high level of accuracy (88.3%, 84.7%) in internal validation and derivation datasets, slightly higher than the model 3 (84.1%, 85.7%) and model 4 (84.7%, 86.1%). However, when compared externally, the mode 4 yielded the same accuracy with the model 3 (75.2%), which is little higher than model 2 (74.3%). The most likely explanation for this phenomenon is insufficient samples and human intervention during the deep learning network derivation process 29 . The combined model is superior to another 3 model alone in external validation set (P all < 0.05). We hypothesized that the result indirectly reflected the classification performance of the DL model and associated fusion model in lung ADC, which may be affected by the overfitting problem during deep learning model's data processing. Our study demonstrates that combining CT-based radiomics and DL-score with clinical implications would be another method for preoperative diagnosis of lung ADC with MPP/ SOL pattern, but it may not be prudent to combine R-score, DL-score and clinical features. Therefore, further research and development of novel fusion methods for fusing various types of features will need to be conducted.
As regards the issue of batch effect due to the utility of different CT devices, reported in some study 33,34 , we benefit form Qu et al. 34 study, firstly adressing the principal component analysis (PCA) on the selected radiomics features to detect the batch effect. The results elucidated that no significant batch effects ware observed among the data obtained by different CT scanners (shown in Fig. S3). As a result, statistical harmonization methods such as ComBat was not subsequently implemented to calibrate the data.
This study has the following limitations. First, the opaque nature of deep learning mechanism may be a source of contention 28 , as does the information-fusion strategy. Second, we did not pay enough attention in our investigation to the component of MPP and SOL structure, and lung ADC with multiple subtypes may be possess typical information of hybrid subtypes rather than that of certain subtypes, thus limiting the subtype discriminability 15 and producing some choice bias. Although we examined the models' performance internally and externally, the performance of the clinicoradiographic model, radiomics model and DL model is similar with respect to the prediction of the lung ADC with MPP/SOL pattern, a larger prospective dataset may be urgent to affirm these models' performance 30 . Additional samples and data processing ways such as mixup, cutout, and cutmix are required to confirm the impact of deep learning on lung ADC subtypes detection 32 . www.nature.com/scientificreports/ Finally, human-craft features, such as segmentation reproducible, manual outlining determination accuracy 35 , results floating with the variation of the initial segmentation. Therefore, additional work is required, including the collection of more representative data, searching for more feasible image visualization platform, attempt to other algorithm for image segmentation and more features such as wavelet features need to be further discussed.

Conclusion
In summary, we have established a combined model based on the DL-score, R-score and clinicoradiographic features to distinguish lung ADC with MPP/SOL structure. In comparison to the clinicoradiographic, radiomics features and DL-score-based scheme model separately, the new combined model enhanced the classification performance of clinicoradiologic features and deep learning model. So, the combined model may assist radiologists in differentiating lung ADC with MPP/SOL pattern. The results are still warranted to be certified in more further studies.

Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.